Comparison of Word-Based and Syllable-Based Retrieval for Tibetan

نویسندگان

  • Paul G. Hackett
  • Douglas W. Oard
چکیده

Tibetan retrieval based on automatically segmented words is compared with the use of overlapping syllable n-grams using a known-item retrieval evaluation. The optimal span of fixed-length n-grams is found to be 2 syllables, and indexing words is found to be as effective as indexing syllable bigrams.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tibetan Multi-word Expressions Identification Framework Based on News Corpora

This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao’s N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework wh...

متن کامل

Using syllable-based indexing features and language models to improve German spoken document retrieval

Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used t...

متن کامل

Stemming and Segmentation for Classical Tibetan

Tibetan is a monosyllabic language for which computerized language tools are largely lacking. We describe the development of a syllable stemmer for Tibetan. The stemmer is based on a set of rules that strive to identify the vowel, the core letter of the syllable, and then the other parts. We demonstrate the value of the stemmer with two applications: determining stem similarity of two syllables...

متن کامل

Connected Component Based Word Spotting on Persian Handwritten image documents

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...

متن کامل

Syllable-based Language Models in Speech Recognition for English Spoken Document Retrieval

The spoken content of audio/visual collections such as TV or radio archives is an information resource of enormous potential. The challenge is to develop methods that will make it possible to browse or search these collections. The experimental results presented in this paper demonstrate that syllable-level transcripts provide an important supplement to conventional word-level transcripts for t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000